Understand the cognitive and psychological underpinnings of perceiving data visualization.
Identify good data visualizations and describe what makes them good.
Produce data visualizations according to types of questions, data, and more, with a particular emphasis on building a toolbox that you can carry into your own research.
Course Expectations
~50% of the course will be in R
You will get the most from this course if you:
have your own data you can apply course content to
know how to clean clean, transform, and manage that data
today’s workshop is a good litmus test for this
Course Materials
All materials (required and optional) are free and online
Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I’ll briefly introduce each function, then link them all together using an example of basic data cleaning and summary.
1. %>%
The pipe %>% is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43.
Code
sqrt(43)
[1] 6.557439
Code
round(sqrt(43), 2)
[1] 6.56
The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out.
Code
round(sqrt(43/2), 2)
[1] 4.64
The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of %>% reads and operates as “and then.” So with the rounded square root of 43, for example:
Code
sqrt(43) %>%round(2)
[1] 6.56
2. filter()
Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.
Code
data(bfi) # grab the bfi data from the psych packagebfi <- bfi %>%as_tibble()head(bfi)
Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.
Code
summary(bfi$age) # get age descriptives
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 20.00 26.00 28.78 35.00 86.00
Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.
Code
bfi2 <- bfi %>%# see a pipe!filter(age <=18) # filter to age up to 18summary(bfi2$age) # summary of the new data
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.0 16.0 17.0 16.3 18.0 18.0
But this isn’t quite right. We still have folks below 12. But, the beauty of filter() is that you can do sequence of OR and AND statements when there is more than one condition, such as up to 18 AND at least 12.
Code
bfi2 <- bfi %>%filter(age <=18& age >=12) # filter to age up to 18 and at least 12summary(bfi2$age) # summary of the new data
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.0 16.0 17.0 16.4 18.0 18.0
Got it!
But filter works for more use cases than just conditional
<, >, <=, and >=
It can also be used for cases where we want a single values to match cases with text.
To do that, let’s convert one of the variables in the bfi data frame to a string.
So let’s change gender (1 = male, 2 = female) to text (we’ll get into factors later).
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.0 16.0 17.0 16.4 18.0 18.0
3. select()
If filter() is for pulling certain observations (rows), then select() is for pulling certain variables (columns).
it’s good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM.
In our bfi data, most of these have been pre-removed, so instead, we’ll imagine we don’t want to use any indicators of Agreeableness (A1-A5) and that we aren’t interested in gender.
With select(), there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of select() helper functions.
Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it
Although there is a base Rsort() function, the arrange() function is tidyverse version that plays nicely with other tidyverse functions.
So in our previous examples, we could also arrange() our data by age or education, rather than simply filtering. (Or as we’ll see later, we can do both!)
Code
# sort by agebfi %>%select(gender:age) %>%arrange(age) %>%print(n =6)
# A tibble: 2,800 × 3
gender education age
<int> <chr> <int>
1 1 Higher Degree 3
2 2 <NA> 9
3 2 Some College 11
4 2 <NA> 11
5 2 <NA> 11
6 2 <NA> 12
# ℹ 2,794 more rows
Code
# sort by educationbfi %>%select(gender:age) %>%arrange(education) %>%print(n =6)
---title: "Week 1 (Workbook) - Getting Situated in R and tidyverse"author: "Emorie D Beck"format: html: code-tools: true code-copy: true code-line-numbers: true code-link: true theme: united highlight-style: tango df-print: paged code-fold: show toc: true toc-float: true self-contained: true # height: 900 footer: "PSC 290 - Data Visualization in R" logo: "https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/ucdavis_logo_blue.png"editor: visualeditor_options: chunk_output_type: console---```{r}library(knitr)library(psych)library(plyr)library(tidyverse)```## Course Goals & Learning Outcomes1. Understand the cognitive and psychological underpinnings of perceiving data visualization.2. Identify good data visualizations and describe what makes them good.3. Produce data visualizations according to types of questions, data, and more, with a particular emphasis on building a toolbox that you can carry into your own research.## Course Expectations- \~50% of the course will be in R- You will get the most from this course if you: - have your own data you can apply course content to - know how to clean clean, transform, and manage that data - today's workshop is a good litmus test for this## Course Materials- All materials (required and optional) are free and online - Wickham & Grolemond: *R for Data Science* <https://r4ds.had.co.nz> - Wickham: *Advanced R* <http://adv-r.had.co.nz> - Wilke: *Fundamentals of Data Visualization* <https://clauswilke.com/dataviz/> - Healy: *Data Visualization: A Practical Introduction* <https://socviz.co> - [Data Camp](https://www.datacamp.com/groups/shared_links/bd4b5bf66c2195741f388ca82c1b19c89b6f37754c7cfa0eb5aded5f16678360): All paid content unlocked## Assignments| | ||----------------------------------|-------------|| **Assignment Weights** | **Percent** || Problem Sets (5) | 20% || Response Papers + Visualizations | 20% || Final Project Proposal | 10% || Class Presentation | 20% || Final Project | 30% || **Total** | **100%** |### Response Papers / Visualizations- The main homework in the course is your weekly visualization assignment- The goal is to demonstrate how the principles and skills you learn in the class function "in the wild."- These should be fun and not taken too seriously! No one is judging you for a pulling a graphic from Instagram instead of Nature.- Due 12:00 PM the day before (i.e. Tuesday) class (last class is "free points")### Problem Sets- About every other week, there will be a practice set to help you practice what you're learning.- These will have you apply the code you've been learning to your own data or a provided data set- Assigning them every other week aims to reduce burden while still allowing you to practice- Frequency / form will be adjusted as needed throughout the quarter### Final Projects- I don't want you to walk out of this course and not know how to apply what you learned- Final project replaces final exam (there are no exams)- Create a visualization for an ongoing project! - Stage 1: Proposal (due 02/12/25) - Stage 2: 1-on-1 meetings + feedback (due by 02/26/25) - Stage 3: In-class presentations (03/12/25) - Stage 4: Final visualization + brief write-up (due 03/19/25 at midnight)### Extra Credit- Lots of talk series, etc. this winter- 1 pt extra credit for each one you: - go to, - take a snap of a data viz, - and critique it according to what you've learned in class- max 5 pts## Class Time- \~5-10 min: welcome and review (if needed)- \~20-35 min: discussion / some lecture content on readings- \~5-10 min: break- \~40-60 min: workshop- \~20-30 min: open lab## Course Topics::::: columns::: column1. Intro and Overview2. Cognitive Perspectives3. Proportions and Probability4. Differences and Associations5. Change and Time Series6. Uncertainty7. Piecing Visualizations Together:::::: column8. Polishing Visualizations9. Interactive Visualizations Additional Topics:- Spatial Information- Automated Reports- Diagrams- More?::::::::# Questions on the Syllabus and Course?# Data Visualization## Why Should I Care About Data Visualization- Summarizing huge amounts of information- Seeing the forest and the trees- Errors in probabilistic reasoning- It's fun!## Why Visualize Data in R::::: columns::: {.column width="60%"}- Open source and freeeee- Flexible- Reproducible- Flexible formatting / output- Lots of model- and package-specific support- Did I mention free?:::::: {.column width="40%"}```{r}knitr::include_graphics("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/R_logo.svg.png")```::::::::## Why Use RStudio (Pivot)::::: columns::: {.column width="60%"}- Also free- Basically a GUI for R- Organize files, import data, etc. with ease- RMarkdown, Quarto, and more are powerful tools (they were used to create these slides!)- Lots of new features and support:::::: {.column width="40%"}```{r}knitr::include_graphics("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/RStudio-Logo-Flat.png")```::::::::## Why Use the `tidyverse`::::: columns::: {.column width="70%"}- Maintained by RStudio (Pivot)- No one should have to use a for loop to change data from long to wide- Tons of integrated tools for data cleaning, manipulation, transformation, and visualization- Even increasing support for modeling (e.g., `tidymodels`):::::: {.column width="30%"}```{r, fig.align='center', echo = F}knitr::include_graphics("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/tidyverse.png")```::::::::::: {layout="[[1,1, 1, 1], [1, 1, 1,1], [1, 1, 1,1]]"}```{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/tidyr.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/stringr.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/shiny.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/rmarkdown.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/quarto.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/knitr.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/ggplot2.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/forcats.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/dplyr.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/broom.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/tibble.png")``````{r, fig.align='center', echo = F, out.width = "60%"}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/purrr.png")```:::## Goals for Today- **Review** core principles of: - `dyplr` (data manipulation) - `tidyr` (data transformation and reshaping)::::: {.columns style="display: flex !important; height: 90%;"}::: {.column width="70%" style="display: flex; align-items: center;"}<!-- <p style="font-size:160%;"> --># Data Manipulation in `dplyr`<!-- </p> -->:::::: {.column width="30%" style="display: flex; justify-content: center; align-items: center;"}```{r, fig.align='center', echo = F}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/main/thumbs/dplyr.png")```::::::::# `dplyr` Core Functions1. **`%>%`**: The pipe. Read as "and then."2. **`filter()`**: Pick observations (rows) by their values.3. **`select()`**: Pick variables (columns) by their names.4. **`arrange()`**: Reorder the rows.5. **`group_by()`**: Implicitly split the data set by grouping by names (columns).6. **`mutate()`**: Create new variables with functions of existing variables.7. **`summarize()` / `summarise()`**: Collapse many values down to a single summary.## Core Functions:::::: columns:::: {.column width="40%"}::: nonincremental1. **`%>%`**2. **`filter()`**3. **`select()`**4. **`arrange()`**5. **`group_by()`**6. **`mutate()`**7. **`summarize()`**:::::::::: {.column width="60%" style="text-align: center; background-color: #FFD966; color: black; border: 5px solid #033266;"}Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I'll briefly introduce each function, then link them all together using an example of basic data cleaning and summary.:::::::::## 1. `%>%`- The pipe `%>%` is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43.```{r, echo = T}sqrt(43)round(sqrt(43), 2)```The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out.```{r, echo = T}round(sqrt(43/2), 2)```The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of `%>%` reads and operates as "and then." So with the rounded square root of 43, for example:```{r, echo = T}sqrt(43) %>%round(2)```## 2. `filter()`Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include.<!-- Say for example, that you're interested personality change in adolescence, but you just opened a survey up online. So when you actually download and examine your data, you realize that you have an age range of something like 3-86, not 12-18. In this case, you want to get rid of the people over 18 -- that is, `filter()` them out. -->```{r, echo=TRUE}data(bfi) # grab the bfi data from the psych packagebfi <- bfi %>%as_tibble()head(bfi)```Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include.```{r, echo = T}summary(bfi$age) # get age descriptives```Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don't want to include.```{r, echo = T}#| code-line-numbers: "|2"bfi2 <- bfi %>%# see a pipe!filter(age <=18) # filter to age up to 18summary(bfi2$age) # summary of the new data ```But this isn't quite right. We still have folks below 12. But, the beauty of `filter()` is that you can do sequence of `OR` and `AND` statements when there is more than one condition, such as up to 18 `AND` at least 12.```{r, echo = T}bfi2 <- bfi %>%filter(age <=18& age >=12) # filter to age up to 18 and at least 12summary(bfi2$age) # summary of the new data ```Got it!- But filter works for more use cases than just conditional - `<`, `>`, `<=`, and `>=`- It can also be used for cases where we want a single values to match cases with text.- To do that, let's convert one of the variables in the `bfi` data frame to a string.- So let's change gender (1 = male, 2 = female) to text (we'll get into factors later).```{r, echo = T}bfi$education <- plyr::mapvalues(bfi$education, 1:5, c("Below HS", "HS", "Some College", "College", "Higher Degree"))```Now let's try a few things:<fontcolor="#033266">**1. Create a data set with only individuals with some college (`==`).**</font>```{r, echo = T}bfi2 <- bfi %>%filter(education =="Some College")unique(bfi2$education)```<fontcolor="#033266">**2. Create a data set with only people age 18 (`==`).**</font>```{r, echo = T}bfi2 <- bfi %>%filter(age ==18)summary(bfi2$age)```<fontcolor="#033266">**3. Create a data set with individuals with some college or above (`%in%`).**</font>```{r, echo = T}bfi2 <- bfi %>%filter(education %in%c("Some College", "College", "Higher Degree"))unique(bfi2$education)````%in%` is great. It compares a column to a vector rather than just a single value, you can compare it to several```{r, echo = T}bfi2 <- bfi %>%filter(age %in%12:18)summary(bfi2$age)```## 3. `select()`- If `filter()` is for pulling certain observations (rows), then `select()` is for pulling certain variables (columns).- it's good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM.- In our `bfi` data, most of these have been pre-removed, so instead, we'll imagine we don't want to use any indicators of Agreeableness (A1-A5) and that we aren't interested in gender.- With `select()`, there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of `select()` helper functions.### A. Bare quote columns we want to keep:::::: columns::: column```{r, echo = T}#| code-line-numbers: "|2"bfi %>%select(C1, C2, C3, C4, C5) %>%print(n =6)```:::::: column```{r, echo=T}#| code-line-numbers: "|2"bfi %>%select(C1:C5) %>%print(n =6)```<!-- We can also use `:` to grab a *range* of columns. -->::::::::### B. Bare quote columns we don't want to keep:```{r, echo = T}#| code-line-numbers: "|2"bfi %>%select(-(A1:A5), -gender) %>%# Note the `()` around the columnsprint(n =6)```### C. Add or remove using `select()` helper functions.:::::: columns::: {.column width="40%"}- `starts_with()`\- `ends_with()`- `contains()`- `matches()`- `num_range()`- `one_of()`- `all_of()`::::::: {.column width="60%"}::: fragment```{r, echo = T}bfi %>%select(starts_with("C"))```:::::::::::::## 4. `arrange()`- Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it- Although there is a base `R``sort()` function, the `arrange()` function is `tidyverse` version that plays nicely with other `tidyverse functions`.::::: columnsSo in our previous examples, we could also `arrange()` our data by age or education, rather than simply filtering. (Or as we'll see later, we can do both!)::: {.column width="50%"}```{r, echo = T}#| code-line-numbers: "|4"# sort by agebfi %>%select(gender:age) %>%arrange(age) %>%print(n =6)```:::::: {.column width="50%"}```{r, echo=TRUE}#| code-line-numbers: "|4"# sort by educationbfi %>%select(gender:age) %>%arrange(education) %>%print(n =6)```::::::::We can also arrange by multiple columns, like if we wanted to sort by gender then education:```{r, echo = T}bfi %>%select(gender:age) %>%arrange(gender, education) %>%print(n =6)```# Bringing it all together: Split-Apply-Combine- Much of the power of `dplyr` functions lay in the split-apply-combine method- A given set of of data are: - *split* into smaller chunks - then a function or series of functions are *applied* to each chunk - and then the chunks are *combined* back together## 5. `group_by()`- The `group_by()` function is the "split" of the method- It basically implicitly breaks the data set into chunks by whatever bare quoted column(s)/variable(s) are supplied as arguments.So imagine that we wanted to `group_by()` education levels to get average ages at each level```{r, echo = T}bfi %>%select(starts_with("C"), age, gender, education) %>%group_by(education) %>%print(n =6)```- Hadley's first law of data cleaning: "What is split, must be combined"- This is super easy with the `ungroup()` function:```{r, echo=TRUE}bfi %>%select(starts_with("C"), age, gender, education) %>%group_by(education) %>%ungroup() %>%print(n =6)```Multiple `group_by()` calls overwrites previous calls:```{r, echo = T}bfi %>%select(starts_with("C"), age, gender, education) %>%group_by(education) %>%group_by(gender, age) %>%print(n =6)```## 6. `mutate()`- `mutate()` is one of your "apply" functions- When you use `mutate()`, the resulting data frame will have the same number of rows you started with- You are directly mutating the existing data frame, either modifying existing columns or creating new onesTo demonstrate, let's add a column that indicated average age levels within each age group```{r, echo = T}bfi %>%select(starts_with("C"), age, gender, education) %>%arrange(education) %>%group_by(education) %>%mutate(age_by_edu =mean(age, na.rm = T)) %>%print(n =6)````mutate()` is also super useful even when you aren't groupingWe can create a new category```{r, echo = T}bfi %>%select(starts_with("C"), age, gender, education) %>%mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))```We could also just overwrite it:```{r, echo = T}bfi %>%select(starts_with("C"), age, gender, education) %>%mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))```## 7. `summarize()` / `summarise()`- `summarize()` is one of your "apply" functions- The resulting data frame will have the same number of rows as your grouping variable- You number of groups is 1 for ungrouped data frames```{r, echo = T}# group_by() educationbfi %>%select(starts_with("C"), age, gender, education) %>%arrange(education) %>%group_by(education) %>%summarize(age_by_edu =mean(age, na.rm = T)) ``````{r, echo = T}# no groups bfi %>%select(starts_with("C"), age, gender, education) %>%arrange(education) %>%summarize(age_by_edu =mean(age, na.rm = T)) ```::::: {.columns .v-center-container}::: {.column width="70%"}<pstyle="font-size:160%;">Data Wrangling in `tidyr`</p>:::::: {.column width="30%"}```{r, fig.align='center'}knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/master/thumbs/tidyr.png")```::::::::# `tidyr`- Now, let's build off what we learned from dplyr and focus on *reshaping* and *merging* our data.- First, the reshapers:1. `pivot_longer()`, which takes a "wide" format data frame and makes it long.\2. `pivot_wider()`, which takes a "long" format data frame and makes it wide.- Next, the mergers:3. `full_join()`, which merges *all* rows in either data frame\4. `inner_join()`, which merges rows whose keys are present in *both* data frames\5. `left_join()`, which "prioritizes" the first data set\6. `right_join()`, which "prioritizes" the second data set(See also:`anti_join()` and `semi_join()`)# Key `tidyr` Functions## 1. `pivot_longer()`- (Formerly `gather()`) Makes wide data long, based on a key <fontsize="5">- Core arguments: - `data`: the data, blank if piped - `cols`: columns to be made long, selected via `select()` calls - `names_to`: name(s) of key column(s) in new long data frame (string or string vector) - `values_to`: name of values in new long data frame (string) - `names_sep`: separator in column headers, if multiple keys - `values_drop_na`: drop missing cells (similar to `na.rm = T`) </font>### Basic ApplicationLet's start with an easy one -- one key, one value:```{r, echo=TRUE}bfi %>%rownames_to_column("SID") %>%pivot_longer(cols = A1:O5 , names_to ="item" , values_to ="values" , values_drop_na = T ) %>%print(n =8)```### More Advanced ApplicationNow a harder one -- two keys, one value:```{r, echo=TRUE}bfi %>%rownames_to_column("SID") %>%pivot_longer(cols = A1:O5 , names_to =c("trait", "item_num") , names_sep =-1 , values_to ="values" , values_drop_na = T ) %>%print(n =8)```## 2. `pivot_wider()` {.smaller}- (Formerly `spread()`) Makes wide data long, based on a key <fontsize="6">- Core arguments: - `data`: the data, blank if piped - `names_from`: name(s) of key column(s) in new long data frame (string or string vector) - `names_sep`: separator in column headers, if multiple keys - `names_glue`: specify multiple or custom separators of multiple keys - `values_from`: name of values in new long data frame (string) - `values_fn`: function applied to data with duplicate labels </font>### Basic Application```{r}bfi_long <- bfi %>%rownames_to_column("SID") %>%pivot_longer(cols = A1:O5 , names_to ="item" , values_to ="values" , values_drop_na = T )```### More Advanced```{r, results = 'hide'}bfi_long <- bfi %>%rownames_to_column("SID") %>%pivot_longer(cols = A1:O5 , names_to =c("trait", "item_num") , names_sep =-1 , values_to ="values" , values_drop_na = T )``````{r, echo = T}bfi_long %>%pivot_wider(names_from =c("trait", "item_num") , values_from ="values" , names_sep ="_" )```### A Little More Advanced```{r, echo = T}bfi_long %>%select(-item_num) %>%pivot_wider(names_from ="trait" , values_from ="values" , names_sep ="_" , values_fn = mean )```# More `dplyr` Functions## The `_join()` Functions- Often we may need to pull different data from different sources- There are lots of reasons to need to do this- We don't have time to get into all the use cases here, so we'll talk about them in high level terms- We'll focus on: - `full_join()` - `inner_join()` - `left_join()` - `right_join()`- Let's separate demographic and BFI data```{r, echo = T}#| code-line-numbers: "|3"bfi_only <- bfi %>%rownames_to_column("SID") %>%select(SID, matches("[0-9]"))bfi_only %>%as_tibble() %>%print(n =6)``````{r, echo = T}#| code-line-numbers: "|3"bfi_dem <- bfi %>%rownames_to_column("SID") %>%select(SID, education, gender, age)bfi_dem %>%as_tibble() %>%print(n =6)```Before we get into it, as a reminder, this is what the data set looks like before we do any joining:```{r, echo = T}bfi %>%rownames_to_column("SID") %>%as_tibble() %>%print(n =6)```## 3. `full_join()`Most simply, we can put those back together keeping all observations.```{r, echo = T}bfi_only %>%full_join(bfi_dem) %>%as_tibble() %>%print(n =6)```## 4. `inner_join()`We can also keep all rows present in *both* data frames```{r, echo = T}#| code-line-numbers: "|1-2|4-5|3"bfi_dem %>%filter(row_number() %in%1:1700) %>%inner_join( bfi_only %>%filter(row_number() %in%1200:2800) ) %>%as_tibble() %>%print(n =6)```## 5. `left_join()`Or all rows present in the left (first) data frame, perhaps if it's a subset of people with complete data```{r, echo = T}#| code-line-numbers: "|2|3"bfi_dem %>%drop_na() %>%left_join(bfi_only) %>%as_tibble() %>%print(n =6)```## 6. `right_join()`Or all rows present in the right (second) data frame, such as I do when I join a codebook with raw data```{r, echo = T}#| code-line-numbers: "|3"bfi_dem %>%drop_na() %>%right_join(bfi_only) %>%as_tibble() %>%print(n =6)```